Skip to content

Introducing a new, hierarchical amdgpu_family_matrix layout#3248

Merged
HereThereBeDragons merged 7 commits into
mainfrom
users/lpromber/refactor_amdgpu_matrix
Feb 12, 2026
Merged

Introducing a new, hierarchical amdgpu_family_matrix layout#3248
HereThereBeDragons merged 7 commits into
mainfrom
users/lpromber/refactor_amdgpu_matrix

Conversation

@HereThereBeDragons
Copy link
Copy Markdown
Contributor

@HereThereBeDragons HereThereBeDragons commented Feb 4, 2026

This PR is part of enabling CI weekly (big picture PR #1732 ) . For this, refactoring of amdgpu_family is needed for easier selection of a specific gpu, and not having to rely on knowing in which group the specific gpu is part of (presubmit/postsubmit/nightlies).

New layout puts all gpus in a single dictionary amdgpu_family_info_matrix_all. Choices for pre/postsubmit and nightlies are now done via a list.

amdgpu_family_info_matrix_all has more depth in hierarchy to better define which parameters belong to which step: build, test, release.

  • Tests can be now turned off with a single bool flag. no need to comment out the runner
  • Default runner labels are "test", "benchmark" and "test-runs-on-multi-gpu". Further labels can be introduced, e.g. "oem" which in the future can overwrite the default runners.

This new layout is introduced in parallel to the old one. The old one stays unchanged to allow gradual move for the CI to use the new layout.

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

To review this you need to compare it with: build_tools/github_actions/amdgpu_family_matrix.py
(did another check of that it matches today, 4ht Feb)

I added some more reviewers just for awareness of these changes.

Copy link
Copy Markdown
Contributor

@jayhawk-commits jayhawk-commits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the new matrix format and I am liking many of the changes.

I like adding new boolean fields for control of things like test enablement, so we don't have to delete runner group names temporarily for disabling tests.

Clear separation of definitions between build, test, and release is great.

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

@jayhawk-commits
Copy link
Copy Markdown
Contributor

What will become of the load_test_runner_from_gh_variables and get_all_families_for_trigger_types functions from the original matrix?

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good! just a few housekeeping comments

Comment thread build_tools/github_actions/amdgpu_family_matrix.py
Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
"""

# blueprint = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably redundant as Data layout comment covers this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left it as easy copy and paste if you need to add new archs. could maybe put it at the end of the file and add a note to the front?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(So I would suggest deleting this comment)

Comment thread build_tools/github_actions/new_amdgpu_family_matrix.py Outdated
},
}
},
"gfx115x": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some logic to make these keys lowercase (less mistake prone when declaring gpu families). should we make all these keys consistent and lowercase?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake

}


amdgpu_family_info_matrix_all = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit and this is just an opinion (don't need to implement): can we order these to match amdgpu_family_predefined_groups's order?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it started like this in the beginning... if moving i would rather move it into numeric ordering? what do you think?

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

HereThereBeDragons commented Feb 5, 2026

What will become of the load_test_runner_from_gh_variables and get_all_families_for_trigger_types functions from the original matrix?

load_test_runner_from_gh_variables will be in the generator file e.g. look in get_github_event_args() in

if os.environ.get("LOAD_TEST_RUNNERS_FROM_VAR", "false") == "true":
test_runner_json_str = os.getenv("ROCM_THEROCK_TEST_RUNNERS", "{}")
github_event_args["orgwide_test_runner_dict"] = json.loads(test_runner_json_str)
else:
github_event_args["orgwide_test_runner_dict"] = {}

get_all_families_for_trigger_types is not needed anymore. the generator will work on the entire arch dict and do the splitting based on either the predefined groups and/or the archs manually requested

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

yes the generator will have tests for it. see for current version ( #1732 ): https://github.com/ROCm/TheRock/blob/89c5e71a4ea42b537fe76572035ecd7f719a46b8/build_tools/github_actions/tests/configure_amdgpu_matrix_test.py

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

yeah a better overview would be nice. but definitely needs another PR for it. i also thought about splitting the file per group of families in case we get more archs.

@jayhawk-commits
Copy link
Copy Markdown
Contributor

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

yes the generator will have tests for it. see for current version ( #1732 ): https://github.com/ROCm/TheRock/blob/89c5e71a4ea42b537fe76572035ecd7f719a46b8/build_tools/github_actions/tests/configure_amdgpu_matrix_test.py

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

yeah a better overview would be nice. but definitely needs another PR for it. i also thought about splitting the file per group of families in case we get more archs.

Testing being done in the generator sounds good to me. Testing can add coverage for some of the nits from this PR like lower-case and ordering of items, which I don't have a strong opinion on.

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

i guess we can ignore those CI failures as the files added here are not used anywhere yet?

@jayhawk-commits
Copy link
Copy Markdown
Contributor

i guess we can ignore those CI failures as the files added here are not used anywhere yet?

Agreed

Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
"""

# blueprint = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.

Comment on lines +102 to +104
amdgpu_family_predefined_groups = {
# The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).
"amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos here? - and consistent uppercase or lowercase of X?

Suggested change
amdgpu_family_predefined_groups = {
# The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).
"amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"],
amdgpu_family_predefined_groups = {
# The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).
"amdgpu_presubmit": ["gfx94X-dcgpu", "gfx110x-all", "gfx1151", "gfx120X-all"],

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... i had it capitalized and then let myself convince from geo that lower case is also not a bad choice. i switch it back to "X". matching to the cmake file is definitely what i had on my mind in the original design ... which i then forgot over the last couple of months :(

Comment thread build_tools/github_actions/new_amdgpu_family_matrix.py
},
}
},
"gfx115x": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake

Comment on lines +484 to +485
"gfx103x": {
"dgpu": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the nesting here. Is this planning towards having e.g.

"gfx103x": {
  "all": { },
  "dgpu": { },
  "igpu": { }
}

?

If we only ever have one at a time, it would be simpler to collapse down a level to just

"gfx103X-all": { },
"gfx103X-dgpu": { },
"gfx103X-igpu": { }

There there wouldn't an extra level of nesting between

    "gfx115x": {
        "gfx1153": {     // <--- single target
            "linux": {
                "build": {
                    "expect_failure": True,
                    "build_variants": ["release"],
                },

    "gfx90x": {
        "dcgpu": {    // <--- family grouping
            "linux": {
                "build": {
                    "build_variants": ["release"],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments like this and my previous thought on converting the contents of the matrix in an easier to read format has me thinking that maybe it is worth exploring the reverse. We could define the matrix in format (.yml or ,json for example) that is easier to read and maintain, and then script to consume and test its contents for things we want to standardize like capitalization.

Comment on lines +145 to +146
amdgpu_family_info_matrix_all = {
"gfx94x": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be easier to maintain and edit if we refactored like so:

amdgpu_family_gfx94X = {
  "dcgpu": {  ... }
}

amdgpu_family_gfx110X = {
  "dcgpu": {  ... }
}

amdgpu_family_info_matrix_all = {
  "gfx94X": amdgpu_family_gfx94X,
  "gfx110X": amdgpu_family_gfx110X,
  ...
}

It would be possible to add a family and then forget to "register" it in the _all list though.

I had some similar thoughts back on #1097 and #1350 .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the nesting here.

my idea of the nesting was that everything below "gfx103x" can either be group or arch that are part of "gfx103". just anything matching "gfx103" and adding on demand when needed for the CI.

if we would want to run "igpu", or just the single gpu target "gfx1030" or "gfx1032" we would have the structure like this:

"gfx103x": {
  "all": { },
  "dgpu": { },
  "igpu": { },
  "gfx1030": { }.
   "gfx1032": { }
}

Copy link
Copy Markdown
Contributor Author

@HereThereBeDragons HereThereBeDragons Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it having separate files for better editing:
we could have

amdgpu_family_matrix.py
amdgpu_family_gfx103x.py
amdgpu_family_gfx950.py
amdgpu_family_gfx94x.py
...

in which we do the same trick as we do for the pytorch skip testing:

  • all amdgpu_family_gfx<family>.py have the same dictionary name: amdgpu_family_info
  • amdgpu_family_matrix.py builds amdgpu_family_info_matrix_all by importing all amdgpu_family_gfx*.py files and adding the specific amdgpu_family_info to amdgpu_family_info_matrix_all

structure within amdgpu_family_gfx<family>.py would be:

amdgpu_family_info = {
  "gfx103x": {
    "all": { },
    "dgpu": { },
    "igpu": { },
    "gfx1030": { }.
     "gfx1032": { }
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe having a short call about it would be good?

Copy link
Copy Markdown
Contributor Author

@HereThereBeDragons HereThereBeDragons Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as an addition. this is the current plan in my head:

  • new structure for amdgpu matrix to know what belongs to build, test, release and be independent of if arch belongs to presubmit, postsubmit or nightly_ci
  • new configure_ci ( configure_amdgpu_matrix.py ) that allows for default values for each configuration step and removes entire matrices that are not required (e.g. no testing wanted = no test section in the matrix returned)
  • new setup.yml that propagates the container image to only have to set the image at a single place
  • add weekly ci
  • starting to move over the other cis to use the new layout (should be easy as the only the toplevel workflow elements need to be changed)
  • add multi arch generator in configure_amdgpu_matrix.py
  • looking then if and how the rest of the configurators (fetch_.. , configure_...,) etc can be integrated into configure_amdgpu_matrix.py
  • finish moving all cis to the new format

Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
"""

# blueprint = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(So I would suggest deleting this comment)

Comment thread build_tools/github_actions/amdgpu_family_matrix.py
},
"sanity_check_only_for_family": True,
},
"release": {"push_on_success": True, "bypass_tests_for_releases": True},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a trailing comma after bypass_tests_for_releases so this wraps like the other entries

Suggested change
"release": {"push_on_success": True, "bypass_tests_for_releases": True},
"release": {
"push_on_success": True,
"bypass_tests_for_releases": True,
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i blame pre commit hook :-P

Comment thread build_tools/github_actions/new_amdgpu_family_matrix.py Outdated
Comment thread build_tools/github_actions/new_amdgpu_family_matrix.py Outdated
Comment thread build_tools/github_actions/new_amdgpu_family_matrix.py Outdated
@HereThereBeDragons HereThereBeDragons merged commit fc87910 into main Feb 12, 2026
15 of 16 checks passed
@HereThereBeDragons HereThereBeDragons deleted the users/lpromber/refactor_amdgpu_matrix branch February 12, 2026 22:03
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants